P3 - Limpando dados do OpenStreetMap

UDACITY

por José Ferraz Neto

1. Introdução

O projeto tem por objetivo utilizar técnicas de análise de dados para limpar e posteriormente realizar uma análise exploratória de dados obtidos de uma única cidade na plataforma colaborativa OpenStreetMap.

Desse modo, escolhi a cidade de Brasília, capital federal, local onde vivo a pouco mais de 1 ano, para ser o objeto de estudo desse projeto.

2. Fonte dos Dados

Os dados foram coletados do site Mapzen. O arquivo comprimido possui 7.5MB, enquanto descomprimido totaliza 104.2MB. O arquivo pode ser baixado pelo seguinte link. Abaixo segue a região o qual terá os dados explorados.

In [1]:
import folium
print('A versão de Folium carrega é a',folium.__version__)
A versão de Folium carrega é a 0.2.1
In [2]:
bsb = folium.Map([-15.7880, -47.8468], zoom_start=11)
bsb
Out[2]:

3. Data Wrangling

In [3]:
import xml.etree.ElementTree as ET
from collections import Counter
import pandas as pd
osmfile = "C:\\Users\\jferraz\\Documents\\Pessoal\\Cursos\\Nanodegree\\Analista de Dados\\P3\\dados projeto\\brasilia_brazil.osm"

3.1 Consolidação dos dados brutos

3.1.1 Nodes

Iniciaremos explorando os atributos associados aos nodes dos dados. Em seguida construíremos um dicionário que receberá todas as informações relativas a cada node.

In [79]:
count_node = 0
for event, elem in ET.iterparse(osmfile, events=("start",)):
    if elem.tag == "node":
        if count_node > 5:
            break
        for tag in elem.iter():
            count_node += 1
            print(tag.attrib)
{'lat': '-15.7994386', 'uid': '389299', 'id': '34557099', 'timestamp': '2011-06-17T03:43:52Z', 'version': '12', 'changeset': '8461436', 'lon': '-47.8615793', 'user': 'Alexandremb'}
{'lat': '-15.8013371', 'uid': '389299', 'id': '34557100', 'timestamp': '2011-01-13T17:39:43Z', 'version': '11', 'changeset': '6959210', 'lon': '-47.86221', 'user': 'Alexandremb'}
{'lat': '-15.8011361', 'uid': '463504', 'id': '34557101', 'timestamp': '2014-03-05T21:30:11Z', 'version': '15', 'changeset': '20939056', 'lon': '-47.8628537', 'user': 'erickdeoliveiraleal'}
{'lat': '-15.8000302', 'uid': '463504', 'id': '34557102', 'timestamp': '2014-03-05T21:30:11Z', 'version': '15', 'changeset': '20939056', 'lon': '-47.8663951', 'user': 'erickdeoliveiraleal'}
{'lat': '-15.7997956', 'uid': '463504', 'id': '34557103', 'timestamp': '2014-03-05T21:30:11Z', 'version': '15', 'changeset': '20939056', 'lon': '-47.867146', 'user': 'erickdeoliveiraleal'}
{'lat': '-15.7994921', 'uid': '463504', 'id': '34557141', 'timestamp': '2014-03-05T21:30:12Z', 'version': '16', 'changeset': '20939056', 'lon': '-47.868118', 'user': 'erickdeoliveiraleal'}

A descrição do significado de cada atributo do node pode ser encontrada na página Wiki do OpenStreetMap. Em seguida, iremos consolidar os dados dos nodes em uma lista de dicionários.

In [5]:
node_list = list() #lista de dicionários
node_dict = {} #dicionário que receberá a informação de cada node
attrib_nodes = ('id' and 'changeset' and 'user' and 'uid' and 'version' and 'timestamp' and 'lon' and 'lat') #atributos dos nodes
In [6]:
#PROCESSO ITERATIVO QUE IRÁ PERCORRER OS NODES E CONSTRUIR DICIONÁRIOS COM AS INFORMAÇÕES RELATIVA A ELES E POSTERIORMENTE
#ADICIONÁ-LO A LISTA.
for event, elem in ET.iterparse(osmfile, events=("start",)):
    if elem.tag == "node":
        for tag in elem.iter():
            if len(node_dict) == 0 and attrib_nodes in tag.attrib:
                node_dict['id'] = tag.attrib['id']
                node_dict['lat'] = tag.attrib['lat']
                node_dict['lon'] = tag.attrib['lon']
                node_dict['user'] = tag.attrib['user']
                node_dict['uid'] = tag.attrib['uid']
                node_dict['version'] = tag.attrib['version']
                node_dict['changeset'] = tag.attrib['changeset']
                node_dict['timestamp'] = tag.attrib['timestamp']             
                node_list.append(node_dict)
            elif len(node_dict) != 0 and attrib_nodes in tag.attrib:
                node_dict = {}
                node_dict['id'] = tag.attrib['id']
                node_dict['lat'] = tag.attrib['lat']
                node_dict['lon'] = tag.attrib['lon']
                node_dict['user'] = tag.attrib['user']
                node_dict['uid'] = tag.attrib['uid']
                node_dict['version'] = tag.attrib['version']
                node_dict['changeset'] = tag.attrib['changeset']
                node_dict['timestamp'] = tag.attrib['timestamp']
                node_list.append(node_dict)
In [7]:
node_list[0:2] #AMOSTRA DOS DADOS CONTIDOS NA LISTA DE DICIONÁRIOS DE NODES
Out[7]:
[{'changeset': '8461436',
  'id': '34557099',
  'lat': '-15.7994386',
  'lon': '-47.8615793',
  'timestamp': '2011-06-17T03:43:52Z',
  'uid': '389299',
  'user': 'Alexandremb',
  'version': '12'},
 {'changeset': '6959210',
  'id': '34557100',
  'lat': '-15.8013371',
  'lon': '-47.86221',
  'timestamp': '2011-01-13T17:39:43Z',
  'uid': '389299',
  'user': 'Alexandremb',
  'version': '11'}]

3.1.2 Node's tags

Em seguida, podemos explorar os atributos associados as tags vinculados aos nodes. Em seguida construíremos um dicionário que receberá todas as informações relativas a cada tag.

In [4]:
count_tag = 0
for event, elem in ET.iterparse(osmfile, events=("start",)):
    if elem.tag == "node":
        for node in elem.iter('node'):
            for tag in elem.iter('tag'):
                if count_tag > 5:
                    break
                count_tag += 1
                print(tag.attrib)
{'k': 'highway', 'v': 'traffic_signals'}
{'k': 'crossing', 'v': 'traffic_signals'}
{'k': 'admin_level', 'v': '2'}
{'k': 'alt_name', 'v': 'RA I'}
{'k': 'capital', 'v': 'yes'}
{'k': 'capital_ISO3166-1', 'v': 'yes'}

Cada tag designa um atributo geográfico a uma característica apresentada por um node, way ou relation específico. Para maiores detalhes e exemplos de diferentes tags, podem ser consultados pelo seguinte link. De forma similiar, iremos consolidar os dados das tags dos nodes em uma lista de dicionário.

In [5]:
node_tags_dict = {} #dicionário com os dados individuais das tags do nodes
node_tags_list = list() #lista de dicionários

for event, elem in ET.iterparse(osmfile, events=("start",)):
    if elem.tag == "node":
        for node in elem.iter('node'):
            for tag in elem.iter('tag'):
                if len(node_tags_dict) == 0:
                    node_tags_dict['key'] = tag.attrib['k']
                    node_tags_dict['value'] = tag.attrib['v']
                    node_tags_dict['node_id'] = node.attrib['id']
                    node_tags_list.append(node_tags_dict)
                    
                elif len (node_tags_dict) != 0:
                    node_tags_dict = {}
                    node_tags_dict['key'] = tag.attrib['k']
                    node_tags_dict['value'] = tag.attrib['v']
                    node_tags_dict['node_id'] = node.attrib['id']
                    node_tags_list.append(node_tags_dict)
In [6]:
node_tags_list[0:2] #amostra da lista de dicionários
Out[6]:
[{'key': 'highway', 'node_id': '34557144', 'value': 'traffic_signals'},
 {'key': 'crossing', 'node_id': '34557144', 'value': 'traffic_signals'}]

3.1.3 Ways

Por conseguinte, iremos explorar os atributos associados as ways. Para maiores detalhes sobre essas unidades, clique aqui. Em seguida construíremos um dicionário que receberá todas as informações relativas a cada way.

In [84]:
count_way = 0
for event, elem in ET.iterparse(osmfile, events=("start",)):
    if elem.tag == "way":
        for tag in elem.iter('way'):
            if count_way > 5:
                break
            count_way += 1
            print(tag.attrib)
{'uid': '463504', 'timestamp': '2014-05-15T20:19:22Z', 'version': '30', 'changeset': '22360208', 'id': '5081572', 'user': 'erickdeoliveiraleal'}
{'uid': '463504', 'timestamp': '2014-09-12T05:26:04Z', 'version': '9', 'changeset': '25382004', 'id': '5081822', 'user': 'erickdeoliveiraleal'}
{'uid': '463504', 'timestamp': '2014-06-02T23:53:24Z', 'version': '28', 'changeset': '22704755', 'id': '8159668', 'user': 'erickdeoliveiraleal'}
{'uid': '463504', 'timestamp': '2014-09-12T05:24:08Z', 'version': '9', 'changeset': '25382004', 'id': '8504851', 'user': 'erickdeoliveiraleal'}
{'uid': '463504', 'timestamp': '2016-06-22T19:03:07Z', 'version': '33', 'changeset': '40212173', 'id': '10064569', 'user': 'erickdeoliveiraleal'}
{'uid': '463504', 'timestamp': '2016-06-18T14:53:12Z', 'version': '7', 'changeset': '40117540', 'id': '10064602', 'user': 'erickdeoliveiraleal'}
In [20]:
ways_list = list() #lista de dicionários com dados de todos os ways
ways_dict = {} #dicionários que receberam os dados relativos a cada way individualmente
attrib_ways = ('id' and 'changeset' and 'user' and 'uid' and 'version' and 'timestamp') #atributos vinculados as ways

for event, elem in ET.iterparse(osmfile, events=("start",)):
    if elem.tag == "way":
        for tag in elem.iter():
            if len(ways_dict) == 0 and attrib_ways in tag.attrib:
                ways_dict['id'] = tag.attrib['id']
                ways_dict['user'] = tag.attrib['user']
                ways_dict['uid'] = tag.attrib['uid']
                ways_dict['version'] = tag.attrib['version']
                ways_dict['changeset'] = tag.attrib['changeset']
                ways_dict['timestamp'] = tag.attrib['timestamp']             
                ways_list.append(ways_dict)
            elif len(ways_dict) != 0 and attrib_ways in tag.attrib:
                ways_dict = {}
                ways_dict['id'] = tag.attrib['id']
                ways_dict['user'] = tag.attrib['user']
                ways_dict['uid'] = tag.attrib['uid']
                ways_dict['version'] = tag.attrib['version']
                ways_dict['changeset'] = tag.attrib['changeset']
                ways_dict['timestamp'] = tag.attrib['timestamp']
                ways_list.append(ways_dict)
In [24]:
ways_list[0:2] #amostra da lista de dicionários relativas aos ways
Out[24]:
[{'changeset': '22360208',
  'id': '5081572',
  'timestamp': '2014-05-15T20:19:22Z',
  'uid': '463504',
  'user': 'erickdeoliveiraleal',
  'version': '30'},
 {'changeset': '25382004',
  'id': '5081822',
  'timestamp': '2014-09-12T05:26:04Z',
  'uid': '463504',
  'user': 'erickdeoliveiraleal',
  'version': '9'}]

3.1.4 Way's tags

Em seguida, vamos verificar os atributos associados as tags dos ways. Em seguida construíremos uma lista de dicionários que receberá todas as informações relativas a cada tag vinculada aos ways.

In [85]:
count_ways_tag = 0
for event, elem in ET.iterparse(osmfile, events=("start",)):
    if elem.tag == "way":
        for way in elem.iter('way'):
            for tag in elem.iter('tag'):
                if count_ways_tag > 5:
                    break
                count_ways_tag += 1
                print(tag.attrib)             
{'k': 'name', 'v': 'S1'}
{'k': 'lanes', 'v': '3'}
{'k': 'oneway', 'v': 'yes'}
{'k': 'highway', 'v': 'trunk'}
{'k': 'surface', 'v': 'paved'}
{'k': 'maxspeed', 'v': '60'}
In [15]:
ways_tags_dict = {} #dicionário para receber os atributos das tags associados aos ways
ways_tags_list = list() #lista de dicionários

for event, elem in ET.iterparse(osmfile, events=("start",)):
    if elem.tag == "way":
        for way in elem.iter('way'):
            for tag in elem.iter('tag'):
                if len(ways_tags_dict) == 0:
                    if 'k' in tag.attrib:
                        ways_tags_dict['key'] = tag.attrib['k']
                    if 'v' in tag.attrib:
                        ways_tags_dict['value'] = tag.attrib['v']
                    ways_tags_dict['way_id'] = way.attrib['id']
                    ways_tags_list.append(ways_tags_dict)
                elif len (ways_tags_dict) != 0:
                    ways_tags_dict = {}
                    if 'k' in tag.attrib:
                        ways_tags_dict['key'] = tag.attrib['k']
                    if 'v' in tag.attrib:
                        ways_tags_dict['value'] = tag.attrib['v']
                    ways_tags_dict['way_id'] = way.attrib['id']
                    ways_tags_list.append(ways_tags_dict)
In [100]:
ways_tags_list[0:2] #uma amostra da lista de dicionário
Out[100]:
[{'key': 'name', 'value': 'S1', 'way_id': '5081572'},
 {'key': 'lanes', 'value': '3', 'way_id': '5081572'}]

3.1.5 Way's nodes

Por conseguinte, vamos verificar os atributos associados aos nodes dos ways. Posteriomente, construíremos uma lista de dicionários que receberá todas as informações relativas a cada node vinculado aos ways.

In [87]:
count_ways_nodes = 0
for event, elem in ET.iterparse(osmfile, events=("start",)):
    if elem.tag == "way":
        for way in elem.iter('way'):
            for nd in elem.iter('nd'):
                if count_ways_nodes > 5:
                    break
                count_ways_nodes += 1
                print(nd.attrib)      
{'ref': '123853635'}
{'ref': '123853652'}
{'ref': '654571442'}
{'ref': '111450110'}
{'ref': '123853363'}
{'ref': '34565323'}
In [74]:
ways_nodes_dict = {} #dicionário para receber os atributos das tags associados aos ways
ways_nodes_list = list() #lista de dicionários

for event, elem in ET.iterparse(osmfile, events=("start",)):
    if elem.tag == "way":
        for way in elem.iter('way'):
            for nd in elem.iter('nd'):
                if len(ways_nodes_dict) == 0:
                    ways_nodes_dict['id'] = way.attrib['id']
                    ways_nodes_dict['node_id'] = nd.attrib['ref']
                    ways_nodes_list.append(ways_nodes_dict)
                elif len (ways_nodes_dict) != 0:
                    ways_nodes_dict = {}
                    ways_nodes_dict['id'] = way.attrib['id']
                    ways_nodes_dict['node_id'] = nd.attrib['ref']
                    ways_nodes_list.append(ways_nodes_dict)
In [76]:
ways_nodes_list[0:10]
Out[76]:
[{'id': '5081572', 'node_id': '123853635'},
 {'id': '5081572', 'node_id': '123853652'},
 {'id': '5081572', 'node_id': '654571442'},
 {'id': '5081572', 'node_id': '111450110'},
 {'id': '5081572', 'node_id': '123853363'},
 {'id': '5081822', 'node_id': '34565323'},
 {'id': '5081822', 'node_id': '2849102620'},
 {'id': '5081822', 'node_id': '1553409712'},
 {'id': '5081822', 'node_id': '34557143'},
 {'id': '8159668', 'node_id': '123853252'}]

3.2 Análise Exploratória dos dados Brutos

3.2.1 Endereços

A cidade de Brasília foi uma cidade planejada e possui características únicas. Para os menos familiarizados, os endereços da cidade são um verdadeiro desafio para se compreender em um primeiro contato. Abaixo vamos verificar os mais diversos tipos e analisar se seguem um padrão.

In [9]:
addr_count = 0
for tag in node_tags_list:
    if tag['key'] == 'addr:street':
        if addr_count > 30:
            break
        addr_count += 1
        print(tag['value'])
CLN 408
CLN 408
CLN 408
SQN 312
Avenida das Castanheiras
Condomínio Ville de Montagne Quadra 18
Condomínio Ville de Montagne Quadra 18
Condomínio Ville de Montagne Quadra 18
Setor Policial Sul, Quadra 4, Lote 5
SRES Q 12 bloco M
SRES Q 12 bloco N
Condomínio Ville de Montagne Quadra 18
Rua 300
QS 7 R 400, 1 lt 1 - Areal
CLN 202
CLN 403
CLN 410
CLN 408
CLN 408
CLN 410
CLN 408
CLN 410
CLN 410
CLN 408
CLN 203
CLN 204
CLN 204
CLN 204
CLN 204
CLN 204
CLN 204

Na lista acima é possível verificar que há diversos endereços que usam abreviações para Conjunto, Quadra, Condomínio, Chácara e assim por diante. Assim, iremos padronizar diversos nomes que compõem o nome dos endereços.

In [11]:
#Criar um dicionário que irá receber um conjunto de chaves/valores que serão utilizados para correção dos endereços.

map_stnames = {'Conj.': 'Conjunto',
               'CONJUNTO': 'Conjunto',
               'Q.': 'Quadra',
               'QUADRA': 'Quadra',
               'Cln': 'CLN',
               'Qnm': 'QNM',
               'BLOCO': 'Bloco',
               'Cond.': 'Condomínio',
               'Qnl': 'QNL',
               'Scln': 'SCLN',
               'Cls': 'CLS',
               'Qn': 'QN',
               'Qr': 'QR',
               'Ch.': 'Chácara',
               'BL.': 'Bloco',
               'Av.': 'Avenida'
               }
for k,v in map_stnames.items():
    for i,tag in enumerate(node_tags_list):
        if tag['key'] == 'addr:street':
            if k in tag['value']:
                location = tag['value'].index(k) #determinar o index onde a key de map_stnames se encontra na string
                if location != 0: #caso a sub-string não esteja no início da string
                    comp_nrua = len(k)
                    inicio_string = tag['value'][:location]
                    final_string = tag['value'][(location+comp_nrua):]
                    node_tags_list[i]['value'] = inicio_string+v+final_string #substituição na lista que guarda o nome das ruas
                else: #caso a substring esteja no início da string
                    node_tags_list[i]['value'] = v+tag['value'][len(k):]  
In [12]:
#Verificar se após a padronização ainda existem endereços com "QUADRA", "Av.", "Cond.", por exemplo.
for tag in node_tags_list:
    if tag['key'] == 'addr:street':
        if 'QUADRA' in tag['value']:
            print(tag['value'])
        if 'Av.' in tag['value']:
            print(tag['value'])
        if 'Cond.' in tag['value']:
            print(tag['value'])

Pelo código acima, é possível inferir que a substituição foi bem sucedida.

3.2.2 CEP's

Analisar o padrão que os números dos CEP's estão representados no conjunto de dados.

In [ ]:
for tag1,tag2 in zip(node_tags_list,ways_tags_list):
    if tag1['key'] == 'addr:postcode':
        print(tag1['value'])
    if tag2['key'] == 'addr:postcode':
        print(tag2['value'])

Com os dados acima, pode-se verificar uma falta de padronização na representação do CEP, alguns utilizam hífen como separador dos três último algarismos, outro não. Alguns registros possuem sepador de milhar, enquanto outros omitem. Dessa forma é necessário uma padronização da representação.

In [16]:
#PADRONIZAR OS CEP's
for i,tag in enumerate(node_tags_list):
    if tag['key'] == 'addr:postcode':
        if "." in tag['value']:
            node_tags_list[i]['value'] = tag['value'].replace(".","") #retirada do ponto como separador de milhar
        elif "-" not in tag['value']:
            node_tags_list[i]['value'] = tag['value'][:5]+"-"+tag['value'][5:] #inseração do '-'

for i,tag in enumerate(ways_tags_list):
    if tag['key'] == 'addr:postcode':
        if "." in tag['value']:
            ways_tags_list[i]['value'] = tag['value'].replace(".","") #retirada do ponto como separador de milhar
        elif "-" not in tag['value']:
            ways_tags_list[i]['value'] = tag['value'][:5]+"-"+tag['value'][5:] #inseração do '-'
In [17]:
#Repetição do código para confirmar a padronização da representação númerica dos CEP's
count_cep = 0
for tag1,tag2 in zip(node_tags_list,ways_tags_list):
    if count_cep > 20:
        break
    if tag1['key'] == 'addr:postcode':
        count_cep += 1
        print(tag1['value'])
    if tag2['key'] == 'addr:postcode':
        count_cep += 1
        print(tag2['value'])
71680-357
71680-357
71680-357
70645-135
70645-145
71680-357
71961-540
71968-720
70865-550
70716-900
70830-404
70074-900
70307-902
70333-900
70046-900
70040-906
71503-502
70775-000
70910-900
71503-501
71503-501

3.3 Banco de dados

3.3.1 Preparação dos dados

Os dados acima trabalhados irão alimentar um banco de dados. Para isso, serão criados dataframes, que por sua vez irão ser utilizados para alimentar o banco de dados.

In [55]:
#CRIAR UM DATAFRAME PARA NODES
nodes = pd.DataFrame(columns=['id','lat','lon','user','uid','version','changeset','timestamp'])

nodes['id'] = [node['id'] for node in node_list]
nodes.lat = [node['lat'] for node in node_list]
nodes.lon = [node['lon'] for node in node_list]
nodes.user = [node['user'] for node in node_list]
nodes.uid = [node['uid'] for node in node_list]
nodes.version = [node['version'] for node in node_list]
nodes.changeset = [node['changeset'] for node in node_list]
nodes.timestamp = [node['timestamp'] for node in node_list]
In [69]:
nodes['timestamp'] = pd.to_datetime(nodes['timestamp']) #conversão para datetime
In [73]:
nodes.head()
Out[73]:
id lat lon user uid version changeset timestamp
0 34557099 -15.7994386 -47.8615793 Alexandremb 389299 12 8461436 2011-06-17 03:43:52
1 34557100 -15.8013371 -47.86221 Alexandremb 389299 11 6959210 2011-01-13 17:39:43
2 34557101 -15.8011361 -47.8628537 erickdeoliveiraleal 463504 15 20939056 2014-03-05 21:30:11
3 34557102 -15.8000302 -47.8663951 erickdeoliveiraleal 463504 15 20939056 2014-03-05 21:30:11
4 34557103 -15.7997956 -47.867146 erickdeoliveiraleal 463504 15 20939056 2014-03-05 21:30:11
In [58]:
#CRIAR UM DATAFRAME PARA NODE'S TAGS
node_tags = pd.DataFrame(columns=['key','value','node_id'])

node_tags.key = [node['key'] for node in node_tags_list]
node_tags.value = [node['value'] for node in node_tags_list]
node_tags.node_id = [node['node_id'] for node in node_tags_list]
In [63]:
node_tags.head()
Out[63]:
key value node_id
0 highway traffic_signals 34557144
1 crossing traffic_signals 34557144
2 admin_level 2 34567423
3 alt_name RA I 34567423
4 capital yes 34567423
In [59]:
#CRIAR UM DATAFRAME PARA WAYS
ways = pd.DataFrame(columns=['id','user','uid','version','changeset','timestamp'])

ways.id = [way['id'] for way in ways_list]
ways.user = [way['user'] for way in ways_list]
ways.uid = [way['uid'] for way in ways_list]
ways.version = [way['version'] for way in ways_list]
ways.changeset = [way['changeset'] for way in ways_list]
ways.timestamp = [way['timestamp'] for way in ways_list]
In [71]:
ways['timestamp'] = pd.to_datetime(ways['timestamp']) #conversão para datetime
In [72]:
ways.head()
Out[72]:
id user uid version changeset timestamp
0 5081572 erickdeoliveiraleal 463504 30 22360208 2014-05-15 20:19:22
1 5081822 erickdeoliveiraleal 463504 9 25382004 2014-09-12 05:26:04
2 8159668 erickdeoliveiraleal 463504 28 22704755 2014-06-02 23:53:24
3 8504851 erickdeoliveiraleal 463504 9 25382004 2014-09-12 05:24:08
4 10064569 erickdeoliveiraleal 463504 33 40212173 2016-06-22 19:03:07
In [103]:
#CRIAR UM DATAFRAME PARA WAY'S TAGS
ways_tags = pd.DataFrame(columns=['key','value','way_id'])
ways_tags.key = [tag['key'] for tag in ways_tags_list]
ways_tags.value = [tag['value'] for tag in ways_tags_list]
ways_tags.way_id = [tag['way_id'] for tag in ways_tags_list]
In [104]:
ways_tags.head()
Out[104]:
key value way_id
0 name S1 5081572
1 lanes 3 5081572
2 oneway yes 5081572
3 highway trunk 5081572
4 surface paved 5081572
In [88]:
#CRIAR UM DATAFRAME PARA WAY'S NODES
ways_nodes = pd.DataFrame(columns=['id','node_id'])
ways_nodes['id'] = [node['id'] for node in ways_nodes_list]
ways_nodes.node_id = [node['node_id'] for node in ways_nodes_list]
In [89]:
ways_nodes.head()
Out[89]:
id node_id
0 5081572 123853635
1 5081572 123853652
2 5081572 654571442
3 5081572 111450110
4 5081572 123853363

3.3.2 Alimentação do banco

In [4]:
import sqlite3
conn = sqlite3.connect('ud_bsb.db')
c = conn.cursor()
In [67]:
# Criação da tabela NODES no banco
c.execute('''CREATE TABLE nodes
             (id INTEGER PRIMARY KEY, 
             lat REAL, 
             lon REAL,
             user TEXT,
             uid INTEGER,
             version INTEGER,
             changeset INTEGER,
             timestamp TEXT
             )''')
Out[67]:
<sqlite3.Cursor at 0x40adddc0>
In [74]:
#Alimentação da tabela nodes do bd com o dataframe correspondente;
nodes.to_sql('nodes', conn, if_exists='append', index=False)
In [75]:
# Criação da tabela NODE TAGS no banco
c.execute('''CREATE TABLE nodes_tags
             (id INTEGER PRIMARY KEY AUTOINCREMENT, 
             key TEXT, 
             value TEXT,
             node_id INTEGER,
             FOREIGN KEY(node_id) REFERENCES nodes(id)
             )''')
Out[75]:
<sqlite3.Cursor at 0x40adddc0>
In [76]:
#Alimentação da tabela nodes_tags do bd com o dataframe correspondente;
node_tags.to_sql('nodes_tags', conn, if_exists='append', index=False)
In [77]:
# Criação da tabela WAYS no banco
c.execute('''CREATE TABLE ways
             (id INTEGER PRIMARY KEY, 
             user TEXT, 
             uid INTEGER,
             version INTEGER,
             changeset INTEGER,
             timestamp TEXT
             
             )''')
Out[77]:
<sqlite3.Cursor at 0x40adddc0>
In [78]:
#Alimentação da tabela ways do bd com o dataframe correspondente;
ways.to_sql('ways', conn, if_exists='append', index=False)
In [108]:
# CRIAR TABELA WAYS_TAGS no banco
c.execute('''CREATE TABLE ways_tags (
    key TEXT NOT NULL,
    value TEXT NOT NULL,
    way_id INTEGER NOT NULL,
    FOREIGN KEY (way_id) REFERENCES ways(id)
);''')
Out[108]:
<sqlite3.Cursor at 0x4ea1650>
In [109]:
#Alimentação da tabela ways_tags do bd com o dataframe correspondente;
ways_tags.to_sql('ways_tags', conn, if_exists='append', index=False)
In [90]:
# CRIAR TABELA WAYS_NODES no banco
c.execute('''CREATE TABLE ways_nodes (
    id INTEGER NOT NULL,
    node_id INTEGER NOT NULL,
    FOREIGN KEY (id) REFERENCES ways(id),
    FOREIGN KEY (node_id) REFERENCES nodes(id)
);''')
Out[90]:
<sqlite3.Cursor at 0x4ea1650>
In [91]:
#Alimentação da tabela ways_nodes do bd com o dataframe correspondente;
ways_nodes.to_sql('ways_nodes', conn, if_exists='append', index=False)
In [110]:
#COMMIT DAS ALTERAÇÕES
conn.commit()

3.4 Análise exploratória dos dados consolidados

A capital federal, Brasília, é sede de vários órgãos públicos federais, além disso, é conhecida por ser o centro político do país. Em seguida, iremos levantar a localização dos órgãos públicos que estão mapeados na plataforma do OpenStreetMap.

3.4.1 LOCALIZAÇÃO DE PRÉDIOS GOVERNAMENTAIS

In [5]:
#Aquisição dos atributos dos nodes que são ligadas a instalações prediais governamentais (KEY = OFFICE E VALUE = GOVERNMENT)
gov_data_loc = c.execute('''select key,value,nodes.lat,nodes.lon from nodes_tags 
join nodes on nodes_tags.node_id = nodes.id where node_id in (SELECT nodes.id from nodes_tags join nodes 
on nodes_tags.node_id = nodes.id where nodes_tags.key = 'office' and nodes_tags.value = 'government')''')
In [6]:
df_gov_loc = pd.DataFrame(gov_data_loc.fetchall(),columns=['key','value','lat','lon']) #carregar em um dataframe
df_gov_loc.head()
Out[6]:
key value lat lon
0 name ASTEG - Associação de transporte Escolar do Gama -16.029249 -48.063298
1 office government -16.029249 -48.063298
2 building civic -16.029249 -48.063298
3 name Administração do Gama -16.015962 -48.064706
4 office government -16.015962 -48.064706
In [7]:
#DETERMINAR TODOS OS NODES ASSOCIADOS A WAYS QUE POSSUEM TAGS COM UMA KEY = OFFICE E VALUE GOVERNMENT
ways_gov_loc = c.execute('''SELECT key,value, nodes.lat, nodes.lon from ways_tags JOIN ways_nodes ON 
ways_tags.way_id = ways_nodes.id JOIN nodes ON ways_nodes.node_id = nodes.id WHERE nodes.lat in 
(SELECT nodes.lat from ways_tags JOIN ways_nodes ON ways_tags.way_id = ways_nodes.id JOIN nodes ON 
ways_nodes.node_id = nodes.id WHERE ways_tags.value = 'government') AND nodes.lon in 
(SELECT nodes.lon from ways_tags JOIN ways_nodes ON ways_tags.way_id = ways_nodes.id JOIN 
nodes ON ways_nodes.node_id = nodes.id WHERE ways_tags.value = 'government')''')
In [8]:
#ARMAZENAR A QUERY EM UM DATAFRAME
way_gov_loc = pd.DataFrame(ways_gov_loc.fetchall(),columns=['key','value','lat','lon']) #carregar em um dataframe
way_gov_loc.head()
Out[8]:
key value lat lon
0 building yes -15.799045 -47.861343
1 building:levels 4 -15.799045 -47.861343
2 height 18 -15.799045 -47.861343
3 name Palácio do Planalto -15.799045 -47.861343
4 office government -15.799045 -47.861343
In [9]:
#CRIAR UM DATAFRAME PARA ARMAZENAR OS DADOS DE LOCALIZAÇÃO DOS ÓRGÃOES PÚBLICOS ARMAZENADOS EM df_gov_loc e way_gov_loc
loc_gov_ways_nodes = pd.DataFrame(columns=['nome','lat','lon']) #carregar em um dataframe
In [10]:
index_count_loc_gov = 0
for index,row in way_gov_loc.iterrows():
    if row['key'] == 'name':
        if row['value'] not in loc_gov_ways_nodes['nome'].values:
            index_count_loc_gov += 1
            loc_gov_ways_nodes.set_value(index_count_loc_gov, 'nome',row['value'])
            loc_gov_ways_nodes.set_value(index_count_loc_gov, 'lat',row['lat'])
            loc_gov_ways_nodes.set_value(index_count_loc_gov, 'lon',row['lon'])
In [11]:
#AO TODO FORAM OBTIDOS A LOCALIZAÇÃO DE 107 ÓRGÃOS PÚBLICOS A PARTIR DOS DADOS ARMAZENADOS EM way_gov_loc
loc_gov_ways_nodes.shape
Out[11]:
(107, 3)
In [12]:
for index,row in df_gov_loc.iterrows():
    if row['key'] == 'name':
        if row['value'] not in loc_gov_ways_nodes['nome'].values:
            index_count_loc_gov += 1
            loc_gov_ways_nodes.set_value(index_count_loc_gov, 'nome',row['value'])
            loc_gov_ways_nodes.set_value(index_count_loc_gov, 'lat',row['lat'])
            loc_gov_ways_nodes.set_value(index_count_loc_gov, 'lon',row['lon'])
In [13]:
#AO TODO FORAM OBTIDOS A LOCALIZAÇÃO DE 131 ÓRGÃOS PÚBLICOS A PARTIR DOS DADOS ARMAZENADOS EM df_gov_loc
loc_gov_ways_nodes.shape
Out[13]:
(237, 3)
In [14]:
map_gov = folium.Map([-15.7880, -47.8468], zoom_start=12)
for index,row in loc_gov_ways_nodes.iterrows():
    folium.Marker([row['lat'], row['lon']], popup=row['nome'].encode("latin-1")).add_to(map_gov)
In [15]:
map_gov
Out[15]:

Por meio de zoom na região da esplanada dos ministérios, é possível verificar que vários dos órgãos ali sediados não possuem indicação no mapa.

In [16]:
esplanada = folium.Map([-15.7996,-47.8748],zoom_start=15)
for index,row in loc_gov_ways_nodes.iterrows():
    folium.Marker([row['lat'], row['lon']], popup=row['nome'].encode("latin-1")).add_to(esplanada)
In [17]:
esplanada
Out[17]:

A página wiki do OpenStreetmap recomenada que para escritórios governamentais seja utilizada a tag office=government. Como explica o texto abaixo.

The tag office=government is used to tag government offices of a (supra)national, regional or local government agency or department. In these offices, staff work directly for the government and carry out tasks to administer facilities, operate registries and licensing bureaus, regulate lands and/or people, etc.

Infelizmente, não é isso que ocorre em todos os registros adicionados ao site, como bem mostra os extrato do código xml da cidade de Brasília.

Ministério dos Transportes


<node id="2897250180" lat="-15.7973305" lon="-47.8677576" version="2" timestamp="2014-10-25T01:27:41Z" changeset="26312704" uid="360183" user="wille"> <tag k="ref" v="10"/> <tag k="name" v="Ministério dos Transportes"/> <tag k="phone" v="+55 61 40039846"/> <tag k="amenity" v="bicycle_rental"/> <tag k="website" v="http://www.bikebrasilia.com"/> <tag k="capacity" v="12"/> <tag k="operator" v="Bike Brasília"/> <tag k="opening_hours" v="Mo-Su 06:00-24:00"/>

Ministério da Cultura / Meio Ambiente


<way id="46114584" version="8" timestamp="2015-11-30T20:05:06Z" changeset="35673450" uid="3147342" user="jadson_reis"> <nd ref="587938477"/> <nd ref="587938480"/> <nd ref="3869733864"/> <nd ref="587938476"/> <nd ref="587938477"/> <tag k="name" v="Ministério da Cultura / Meio Ambiente"/> <tag k="building" v="yes"/> <tag k="addr:street" v="Eixo Monumental"/> <tag k="addr:housename" v="Bloco B"/> <tag k="building:levels" v="11"/>

In [18]:
orgp_nodetag_extra = c.execute('''SELECT nodes_tags.value, nodes.lat, nodes.lon from nodes_tags 
                               JOIN nodes ON nodes.id = nodes_tags.node_id where value like '%Ministério%' ''')
In [19]:
df_extra_opb_ntag = pd.DataFrame(orgp_nodetag_extra.fetchall(),columns=['value','lat','lon']) #carregar em um dataframe
In [20]:
df_extra_opb_ntag
Out[20]:
value lat lon
0 Esplanada dos Ministérios -15.797959 -47.869797
1 Ministério da Fazenda -15.800012 -47.878193
2 Ministério das Cidades -15.799835 -47.878688
3 Procuradoria Geral da República 1a regiao - Mi... -15.804587 -47.882245
4 Ministério Público -15.616944 -47.653333
5 Ministério da Defesa -15.796061 -47.871844
6 Ministério da Cultura -15.798410 -47.872430
7 Ministério dos Transportes -15.797330 -47.867758
8 Ministério Vida em Deus -15.811056 -48.120145
9 Ministério do Esporte -15.756377 -47.892926
10 MDS - Ministério do Desenvolvimento Social e C... -15.743255 -47.897169
11 Ministério da Saúde -15.799330 -47.869262
12 Ministério do Planejamento -15.795106 -47.874592
13 Ministério da Cultura (Anexo) -15.794987 -47.893417
14 Ministério da Integração Nacional -15.770275 -47.893103
In [21]:
df_extra_opb_ntag.drop(df_extra_opb_ntag.index[8], inplace=True) #EXCLUSÃO DA ENTRADA REFERENTE A INSTALAÇÃO PREDIAL RELIGIOSA
In [22]:
orgp_waytag_extra = c.execute('''SELECT value, nodes.lat, nodes.lon from ways_tags JOIN ways_nodes ON 
ways_tags.way_id = ways_nodes.id JOIN nodes ON ways_nodes.node_id = nodes.id where value like '%Ministério%' ''')
In [23]:
df_extra_opb_wtag = pd.DataFrame(orgp_waytag_extra.fetchall(),columns=['value','lat','lon']) #carregar em um dataframe
In [24]:
df_extra_opb_wtag.head() #como um way é formado pela combinação de vários nodes, é natural haver múltiplos registros.
Out[24]:
value lat lon
0 Ministério do Desenvolvimento Agrário / Esport... -15.798323 -47.873180
1 Ministério do Desenvolvimento Agrário / Esport... -15.798276 -47.873332
2 Ministério do Desenvolvimento Agrário / Esport... -15.798516 -47.873425
3 Ministério do Desenvolvimento Agrário / Esport... -15.798905 -47.873550
4 Ministério do Desenvolvimento Agrário / Esport... -15.799162 -47.873635
In [25]:
for index,row in df_extra_opb_ntag.iterrows():
    if row['value'] not in loc_gov_ways_nodes['nome'].values:
        index_count_loc_gov += 1
        loc_gov_ways_nodes.set_value(index_count_loc_gov, 'nome',row['value'])
        loc_gov_ways_nodes.set_value(index_count_loc_gov, 'lat',row['lat'])
        loc_gov_ways_nodes.set_value(index_count_loc_gov, 'lon',row['lon'])
In [26]:
#AO TODO FORAM OBTIDOS A LOCALIZAÇÃO DE 8 ÓRGÃOS PÚBLICOS A PARTIR DOS DADOS ARMAZENADOS EM df_extra_opb_ntag
loc_gov_ways_nodes.shape
Out[26]:
(244, 3)
In [27]:
for index,row in df_extra_opb_wtag.iterrows():
    if row['value'] not in loc_gov_ways_nodes['nome'].values:
        index_count_loc_gov += 1
        loc_gov_ways_nodes.set_value(index_count_loc_gov, 'nome',row['value'])
        loc_gov_ways_nodes.set_value(index_count_loc_gov, 'lat',row['lat'])
        loc_gov_ways_nodes.set_value(index_count_loc_gov, 'lon',row['lon'])
In [28]:
#AO TODO FORAM OBTIDOS A LOCALIZAÇÃO DE 24 ÓRGÃOS PÚBLICOS A PARTIR DOS DADOS ARMAZENADOS EM df_extra_opb_wtag
loc_gov_ways_nodes.shape
Out[28]:
(267, 3)
In [29]:
loc_gov_ways_nodes.head()
Out[29]:
nome lat lon
1 Palácio do Planalto -15.799 -47.8613
2 Anexo do Ministério da Educação -15.7937 -47.8729
3 Anexo do Ministério da Aeronáutica -15.794 -47.8721
4 Procuradoria-Geral do Distrito Federal -15.7828 -47.9065
5 Arquivo Central TJDFT -15.7598 -47.9309
In [30]:
mapa_bsb_orgaos = folium.Map([-15.7880, -47.8468], zoom_start=12)
for index,row in loc_gov_ways_nodes.iterrows():
    folium.Marker([row['lat'], row['lon']], popup=row['nome'].encode("latin-1")).add_to(mapa_bsb_orgaos)

Por meio do mapa, abaixo foi possível verificar que a maioria dos Ministérios localizados na esplanada agora possuem marcadores associados a sua localização.

In [31]:
mapa_bsb_orgaos
Out[31]:

3.4.2 Análise dos Colaboradores e seus Hábitos

Os dados de registros do nodes e ways possuem informações sobre horário e dia que as informações foram adicionadas a platafroma do OpenstreetMap. Dessa maneira, é possível explorar essas informações em busca de padrões de comportamento dos usuários, como os horários mais frequentes de uso, bem como os dias e/ou meses do ano.

3.4.2.1 O horários mais frequentes de colaboração

3.4.2.1.1 Dados dos Nodes

In [57]:
nodehour00 = c.execute('''SELECT user, count(user) as user FROM nodes 
WHERE strftime('%H',timestamp) >= '00' AND strftime('%H',timestamp) < '01' group by user order by user desc;''')
df_nodeh00 = pd.DataFrame(nodehour00.fetchall(),columns=['user','count'])
In [58]:
nodehour01 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '01' AND strftime('%H',timestamp) < '02' group by user order by user desc;''')
df_nodeh01 = pd.DataFrame(nodehour01.fetchall(),columns=['user','count'])
In [59]:
nodehour02 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '02' AND strftime('%H',timestamp) < '03' group by user order by user desc;''')
df_nodeh02 = pd.DataFrame(nodehour02.fetchall(),columns=['user','count'])
In [60]:
nodehour03 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '03' AND strftime('%H',timestamp) < '04' group by user order by user desc;''')
df_nodeh03 = pd.DataFrame(nodehour03.fetchall(),columns=['user','count'])
In [61]:
nodehour04 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '04' AND strftime('%H',timestamp) < '05' group by user order by user desc;''')
df_nodeh04 = pd.DataFrame(nodehour04.fetchall(),columns=['user','count'])
In [62]:
nodehour05 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '05' AND strftime('%H',timestamp) < '06' group by user order by user desc;''')
df_nodeh05 = pd.DataFrame(nodehour05.fetchall(),columns=['user','count'])
In [63]:
nodehour06 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '06' AND strftime('%H',timestamp) < '07' group by user order by user desc;''')
df_nodeh06 = pd.DataFrame(nodehour06.fetchall(),columns=['user','count'])
In [64]:
nodehour07 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '07' AND strftime('%H',timestamp) < '08' group by user order by user desc;''')
df_nodeh07 = pd.DataFrame(nodehour07.fetchall(),columns=['user','count'])
In [65]:
nodehour08 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '08' AND strftime('%H',timestamp) < '09' group by user order by user desc;''')
df_nodeh08 = pd.DataFrame(nodehour08.fetchall(),columns=['user','count'])
In [66]:
nodehour09 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '09' AND strftime('%H',timestamp) < '10' group by user order by user desc;''')
df_nodeh09 = pd.DataFrame(nodehour09.fetchall(),columns=['user','count'])
In [67]:
nodehour10 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '10' AND strftime('%H',timestamp) < '11' group by user order by user desc;''')
df_nodeh10 = pd.DataFrame(nodehour10.fetchall(),columns=['user','count'])
In [68]:
nodehour11 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '11' AND strftime('%H',timestamp) < '12' group by user order by user desc;''')
df_nodeh11 = pd.DataFrame(nodehour11.fetchall(),columns=['user','count'])
In [69]:
nodehour12 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '12' AND strftime('%H',timestamp) < '13' group by user order by user desc;''')
df_nodeh12 = pd.DataFrame(nodehour12.fetchall(),columns=['user','count'])
In [70]:
nodehour13 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '13' AND strftime('%H',timestamp) < '14' group by user order by user desc;''')
df_nodeh13 = pd.DataFrame(nodehour13.fetchall(),columns=['user','count'])
In [71]:
nodehour14 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '14' AND strftime('%H',timestamp) < '15' group by user order by user desc;''')
df_nodeh14 = pd.DataFrame(nodehour14.fetchall(),columns=['user','count'])
In [72]:
nodehour15 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '15' AND strftime('%H',timestamp) < '16' group by user order by user desc;''')
df_nodeh15 = pd.DataFrame(nodehour15.fetchall(),columns=['user','count'])
In [73]:
nodehour16 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '16' AND strftime('%H',timestamp) < '17' group by user order by user desc;''')
df_nodeh16 = pd.DataFrame(nodehour16.fetchall(),columns=['user','count'])
In [74]:
nodehour17 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '17' AND strftime('%H',timestamp) < '18' group by user order by user desc;''')
df_nodeh17 = pd.DataFrame(nodehour17.fetchall(),columns=['user','count'])
In [75]:
nodehour18 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '18' AND strftime('%H',timestamp) < '19' group by user order by user desc;''')
df_nodeh18 = pd.DataFrame(nodehour18.fetchall(),columns=['user','count'])
In [76]:
nodehour19 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '19' AND strftime('%H',timestamp) < '20' group by user order by user desc;''')
df_nodeh19 = pd.DataFrame(nodehour19.fetchall(),columns=['user','count'])
In [77]:
nodehour20 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '20' AND strftime('%H',timestamp) < '21' group by user order by user desc;''')
df_nodeh20 = pd.DataFrame(nodehour20.fetchall(),columns=['user','count'])
In [78]:
nodehour21 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '21' AND strftime('%H',timestamp) < '22' group by user order by user desc;''')
df_nodeh21 = pd.DataFrame(nodehour21.fetchall(),columns=['user','count'])
In [79]:
nodehour22 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '22' AND strftime('%H',timestamp) < '23' group by user order by user desc;''')
df_nodeh22 = pd.DataFrame(nodehour22.fetchall(),columns=['user','count'])
In [80]:
nodehour23 = c.execute('''SELECT user, count(user) as user FROM nodes  
WHERE strftime('%H',timestamp) >= '23' AND strftime('%H',timestamp) < '24' group by user order by user desc;''')
df_nodeh23 = pd.DataFrame(nodehour23.fetchall(),columns=['user','count'])

3.4.2.1.2 Dados dos Ways

In [81]:
wayhour00 = c.execute('''SELECT user, count(user) as user FROM ways 
WHERE strftime('%H',timestamp) >= '00' AND strftime('%H',timestamp) < '01' group by user order by user desc;''')
df_wayh00 = pd.DataFrame(nodehour00.fetchall(),columns=['user','count'])

wayhour01 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '01' AND strftime('%H',timestamp) < '02' group by user order by user desc;''')
df_wayh01 = pd.DataFrame(wayhour01.fetchall(),columns=['user','count'])
In [82]:
wayhour02 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '02' AND strftime('%H',timestamp) < '03' group by user order by user desc;''')
df_wayh02 = pd.DataFrame(wayhour02.fetchall(),columns=['user','count'])

wayhour03 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '03' AND strftime('%H',timestamp) < '04' group by user order by user desc;''')
df_wayh03 = pd.DataFrame(wayhour03.fetchall(),columns=['user','count'])

wayhour04 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '04' AND strftime('%H',timestamp) < '05' group by user order by user desc;''')
df_wayh04 = pd.DataFrame(wayhour04.fetchall(),columns=['user','count'])

wayhour05 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '05' AND strftime('%H',timestamp) < '06' group by user order by user desc;''')
df_wayh05 = pd.DataFrame(wayhour05.fetchall(),columns=['user','count'])

wayhour06 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '06' AND strftime('%H',timestamp) < '07' group by user order by user desc;''')
df_wayh06 = pd.DataFrame(wayhour06.fetchall(),columns=['user','count'])

wayhour07 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '07' AND strftime('%H',timestamp) < '08' group by user order by user desc;''')
df_wayh07 = pd.DataFrame(wayhour07.fetchall(),columns=['user','count'])

wayhour08 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '08' AND strftime('%H',timestamp) < '09' group by user order by user desc;''')
df_wayh08 = pd.DataFrame(wayhour08.fetchall(),columns=['user','count'])

wayhour09 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '09' AND strftime('%H',timestamp) < '10' group by user order by user desc;''')
df_wayh09 = pd.DataFrame(wayhour09.fetchall(),columns=['user','count'])

wayhour10 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '10' AND strftime('%H',timestamp) < '11' group by user order by user desc;''')
df_wayh10 = pd.DataFrame(wayhour10.fetchall(),columns=['user','count'])

wayhour11 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '11' AND strftime('%H',timestamp) < '12' group by user order by user desc;''')
df_wayh11 = pd.DataFrame(wayhour11.fetchall(),columns=['user','count'])

wayhour12 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '12' AND strftime('%H',timestamp) < '13' group by user order by user desc;''')
df_wayh12 = pd.DataFrame(wayhour12.fetchall(),columns=['user','count'])

wayhour13 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '13' AND strftime('%H',timestamp) < '14' group by user order by user desc;''')
df_wayh13 = pd.DataFrame(wayhour13.fetchall(),columns=['user','count'])

wayhour14 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '14' AND strftime('%H',timestamp) < '15' group by user order by user desc;''')
df_wayh14 = pd.DataFrame(wayhour14.fetchall(),columns=['user','count'])

wayhour15 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '15' AND strftime('%H',timestamp) < '16' group by user order by user desc;''')
df_wayh15 = pd.DataFrame(wayhour15.fetchall(),columns=['user','count'])

wayhour16 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '16' AND strftime('%H',timestamp) < '17' group by user order by user desc;''')
df_wayh16 = pd.DataFrame(wayhour16.fetchall(),columns=['user','count'])

wayhour17 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '17' AND strftime('%H',timestamp) < '18' group by user order by user desc;''')
df_wayh17 = pd.DataFrame(wayhour17.fetchall(),columns=['user','count'])

wayhour18 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '18' AND strftime('%H',timestamp) < '19' group by user order by user desc;''')
df_wayh18 = pd.DataFrame(wayhour18.fetchall(),columns=['user','count'])

wayhour19 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '19' AND strftime('%H',timestamp) < '20' group by user order by user desc;''')
df_wayh19 = pd.DataFrame(wayhour19.fetchall(),columns=['user','count'])

wayhour20 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '20' AND strftime('%H',timestamp) < '21' group by user order by user desc;''')
df_wayh20 = pd.DataFrame(wayhour20.fetchall(),columns=['user','count'])

wayhour21 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '21' AND strftime('%H',timestamp) < '22' group by user order by user desc;''')
df_wayh21 = pd.DataFrame(wayhour21.fetchall(),columns=['user','count'])

wayhour22 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '22' AND strftime('%H',timestamp) < '23' group by user order by user desc;''')
df_wayh22 = pd.DataFrame(wayhour22.fetchall(),columns=['user','count'])

wayhour23 = c.execute('''SELECT user, count(user) as user FROM ways  
WHERE strftime('%H',timestamp) >= '23' AND strftime('%H',timestamp) < '24' group by user order by user desc;''')
df_wayh23 = pd.DataFrame(wayhour23.fetchall(),columns=['user','count'])
In [83]:
#COMPILAR O TOTAL DAS CONSTRIBUIÇÕES REALIZADAS EM NODES E WAYS EM UM DATAFRAME

str_hrs = ['00:00','01:00','02:00','03:00','04:00','05:00','06:00','07:00','08:00','09:00','10:00','11:00','12:00',
          '13:00','14:00','15:00','16:00','17:00','18:00','19:00','20:00','21:00','22:00','23:00']
count_hrs = [df_nodeh00['count'].sum()+df_wayh00['count'].sum(),df_nodeh01['count'].sum()+df_wayh01['count'].sum(),
df_nodeh02['count'].sum()+df_wayh02['count'].sum(),df_nodeh03['count'].sum()+df_wayh03['count'].sum(),
df_nodeh04['count'].sum()+df_wayh04['count'].sum(),df_nodeh05['count'].sum()+df_wayh05['count'].sum(),
df_nodeh06['count'].sum()+df_wayh06['count'].sum(),df_nodeh07['count'].sum()+df_wayh07['count'].sum(),
df_nodeh08['count'].sum()+df_wayh08['count'].sum(),df_nodeh09['count'].sum()+df_wayh09['count'].sum(),
df_nodeh10['count'].sum()+df_wayh10['count'].sum(),df_nodeh11['count'].sum()+df_wayh11['count'].sum(),
df_nodeh12['count'].sum()+df_wayh12['count'].sum(),df_nodeh13['count'].sum()+df_wayh13['count'].sum(),
df_nodeh14['count'].sum()+df_wayh14['count'].sum(),df_nodeh15['count'].sum()+df_wayh15['count'].sum(),
df_nodeh16['count'].sum()+df_wayh16['count'].sum(),df_nodeh17['count'].sum()+df_wayh17['count'].sum(),
df_nodeh18['count'].sum()+df_wayh18['count'].sum(),df_nodeh19['count'].sum()+df_wayh19['count'].sum(),
df_nodeh20['count'].sum()+df_wayh20['count'].sum(),df_nodeh21['count'].sum()+df_wayh21['count'].sum(),
df_nodeh22['count'].sum()+df_wayh22['count'].sum(),df_nodeh23['count'].sum()+df_wayh23['count'].sum()]
df_count_hrs = pd.DataFrame(count_hrs, columns=['total'],index=str_hrs)

Pelo gráfico a seguir é possível, verificar o intervalo de 2 até as 9 da manhã é quando as pessoas menos colaboram em Brasília com a plataforma do OpenstreetMap. Entre meio-dia (12:00) e as 20 horas encontra-se a maior parte das constribuições do usuários, aproximadamente 55% delas. Como mostra o cálculo abaixo:

In [84]:
#Cálculo do percentual das contribuições entre 12 e 20 horas.
df_count_hrs['12:00' : "20:00"].sum()/df_count_hrs.sum()
Out[84]:
total    0.553047
dtype: float64
In [85]:
import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline
sns.set_context("talk", rc={"lines.linewidth": 2})
user_hour = sns.barplot(x=df_count_hrs.index, y="total", data=df_count_hrs, palette="Blues_d")
plt.xticks(rotation=45)
user_hour.set_ylabel("Número de Constribuições")
Out[85]:
<matplotlib.text.Text at 0xe1bdcf8>

3.4.2.2 As datas mais frequentes de colaboração

Depois de constastar os horários de mais frequentes de uso da plataforma pelo usuários, agora veremos se existem padrões em relações ao meses do ano. Iremos começar determinando as datas mais antiga e a mais recente dos dados que estamos analisando.

In [86]:
# data mais antiga e e mais recente nos dados dos nodes
c.execute("SELECT min(timestamp), max(timestamp) from nodes;")
c.fetchall()
Out[86]:
[('2007-10-21 19:47:06', '2016-11-13 13:30:44')]
In [87]:
# data mais antiga e e mais recente nos dados dos ways
c.execute("SELECT min(timestamp), max(timestamp) from ways;")
c.fetchall()
Out[87]:
[('2008-12-22 07:51:07', '2016-11-13 13:43:05')]

3.4.2.2.1 Dados node's e datas

In [88]:
#2007
jan_2007 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
fev_2007 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
marc_2007 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
abr_2007 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
mai_2007 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
jun_2007 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
jul_2007 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
ago_2007 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
set_2007 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
out_2007 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
nov_2007 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
dez_2007 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]

months_2007 = pd.DataFrame([jan_2007,fev_2007,marc_2007,abr_2007,mai_2007,jun_2007,jul_2007,
                           ago_2007,set_2007,out_2007,nov_2007,dez_2007], columns=['Total'])
In [89]:
#2008
jan_2008 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
fev_2008 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
marc_2008 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
abr_2008 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
mai_2008 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
jun_2008 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
jul_2008 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
ago_2008 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
set_2008 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
out_2008 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
nov_2008 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
dez_2008 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
months_2008 = pd.DataFrame([jan_2008,fev_2008,marc_2008,abr_2008,mai_2008,jun_2008,jul_2008,
                           ago_2008,set_2008,out_2008,nov_2008,dez_2008], columns=['Total'])
In [90]:
#2009
jan_2009 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
fev_2009 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
marc_2009 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
abr_2009 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
mai_2009 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
jun_2009 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
jul_2009 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
ago_2009 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
set_2009 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
out_2009 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
nov_2009 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
dez_2009 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
months_2009 = pd.DataFrame([jan_2009,fev_2009,marc_2009,abr_2009,mai_2009,jun_2009,jul_2009,
                           ago_2009,set_2009,out_2009,nov_2009,dez_2009], columns=['Total'])
In [91]:
#2010
jan_2010 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
fev_2010 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
marc_2010 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
abr_2010 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
mai_2010 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
jun_2010 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
jul_2010 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
ago_2010 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
set_2010 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
out_2010 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
nov_2010 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
dez_2010 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
months_2010 = pd.DataFrame([jan_2010,fev_2010,marc_2010,abr_2010,mai_2010,jun_2010,jul_2010,
                           ago_2010,set_2010,out_2010,nov_2010,dez_2010], columns=['Total'])
In [92]:
#2011
jan_2011 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
fev_2011 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
marc_2011 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
abr_2011 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
mai_2011 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
jun_2011 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
jul_2011 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
ago_2011 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
set_2011 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
out_2011 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
nov_2011 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
dez_2011 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
months_2011 = pd.DataFrame([jan_2011,fev_2011,marc_2011,abr_2011,mai_2011,jun_2011,jul_2011,
                           ago_2011,set_2011,out_2011,nov_2011,dez_2011], columns=['Total'])
In [93]:
#2012
jan_2012 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
fev_2012 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
marc_2012 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
abr_2012 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
mai_2012 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
jun_2012 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
jul_2012 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
ago_2012 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
set_2012 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
out_2012 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
nov_2012 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
dez_2012 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
months_2012 = pd.DataFrame([jan_2012,fev_2012,marc_2012,abr_2012,mai_2012,jun_2012,jul_2012,
                           ago_2012,set_2012,out_2012,nov_2012,dez_2012], columns=['Total'])
In [94]:
#2013
jan_2013 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
fev_2013 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
marc_2013 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
abr_2013 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
mai_2013 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
jun_2013 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
jul_2013 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
ago_2013 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
set_2013 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
out_2013 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
nov_2013 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
dez_2013 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
months_2013 = pd.DataFrame([jan_2013,fev_2013,marc_2013,abr_2013,mai_2013,jun_2013,jul_2013,
                           ago_2013,set_2013,out_2013,nov_2013,dez_2013], columns=['Total'])
In [95]:
#2014
jan_2014 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
fev_2014 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
marc_2014 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
abr_2014 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
mai_2014 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
jun_2014 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
jul_2014 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
ago_2014 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
set_2014 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
out_2014 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
nov_2014 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
dez_2014 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
months_2014 = pd.DataFrame([jan_2014,fev_2014,marc_2014,abr_2014,mai_2014,jun_2014,jul_2014,
                           ago_2014,set_2014,out_2014,nov_2014,dez_2014], columns=['Total'])
In [96]:
#2015
jan_2015 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
fev_2015 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
marc_2015 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
abr_2015 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
mai_2015 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
jun_2015 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
jul_2015 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
ago_2015 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
set_2015 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
out_2015 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
nov_2015 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
dez_2015 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
months_2015 = pd.DataFrame([jan_2015,fev_2015,marc_2015,abr_2015,mai_2015,jun_2015,jul_2015,
                           ago_2015,set_2015,out_2015,nov_2015,dez_2015], columns=['Total'])
In [97]:
#2016
jan_2016 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
fev_2016 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
marc_2016 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
abr_2016 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
mai_2016 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
jun_2016 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
jul_2016 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
ago_2016 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
set_2016 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
out_2016 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
nov_2016 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
dez_2016 = c.execute('''SELECT count(timestamp) FROM nodes WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
months_2016 = pd.DataFrame([jan_2016,fev_2016,marc_2016,abr_2016,mai_2016,jun_2016,jul_2016,
                           ago_2016,set_2016,out_2016,nov_2016,dez_2016], columns=['Total'])

3.4.2.2.2 Dados way's e datas

In [98]:
#2007
janw_2007 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
fevw_2007 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
marcw_2007 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
abrw_2007 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
maiw_2007 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
junw_2007 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
julw_2007 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
agow_2007 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
setw_2007 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
outw_2007 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
novw_2007 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]
dezw_2007 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2007';''').fetchall()[0][0]

months_w2007 = pd.DataFrame([janw_2007,fevw_2007,marcw_2007,abrw_2007,maiw_2007,junw_2007,julw_2007,
                           agow_2007,setw_2007,outw_2007,novw_2007,dezw_2007], columns=['Total'])
In [99]:
#2008
janw_2008 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
fevw_2008 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
marcw_2008 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
abrw_2008 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
maiw_2008 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
junw_2008 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
julw_2008 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
agow_2008 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
setw_2008 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
outw_2008 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
novw_2008 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]
dezw_2008 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2008';''').fetchall()[0][0]

months_w2008 = pd.DataFrame([janw_2008,fevw_2008,marcw_2008,abrw_2008,maiw_2008,junw_2008,julw_2008,
                           agow_2008,setw_2008,outw_2008,novw_2008,dezw_2008], columns=['Total'])
In [100]:
#2009
janw_2009 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
fevw_2009 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
marcw_2009 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
abrw_2009 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
maiw_2009 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
junw_2009 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
julw_2009 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
agow_2009 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
setw_2009 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
outw_2009 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
novw_2009 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]
dezw_2009 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2009';''').fetchall()[0][0]

months_w2009 = pd.DataFrame([janw_2009,fevw_2009,marcw_2009,abrw_2009,maiw_2009,junw_2009,julw_2009,
                           agow_2009,setw_2009,outw_2009,novw_2009,dezw_2009], columns=['Total'])
In [101]:
#2010
janw_2010 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
fevw_2010 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
marcw_2010 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
abrw_2010 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
maiw_2010 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
junw_2010 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
julw_2010 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
agow_2010 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
setw_2010 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
outw_2010 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
novw_2010 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]
dezw_2010 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2010';''').fetchall()[0][0]

months_w2010 = pd.DataFrame([janw_2010,fevw_2010,marcw_2010,abrw_2010,maiw_2010,junw_2010,julw_2010,
                           agow_2010,setw_2010,outw_2010,novw_2010,dezw_2010], columns=['Total'])
In [102]:
#2011
janw_2011 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
fevw_2011 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
marcw_2011 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
abrw_2011 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
maiw_2011 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
junw_2011 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
julw_2011 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
agow_2011 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
setw_2011 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
outw_2011 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
novw_2011 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]
dezw_2011 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2011';''').fetchall()[0][0]

months_w2011 = pd.DataFrame([janw_2011,fevw_2011,marcw_2011,abrw_2011,maiw_2011,junw_2011,julw_2011,
                           agow_2011,setw_2011,outw_2011,novw_2011,dezw_2011], columns=['Total'])
In [103]:
#2012
janw_2012 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
fevw_2012 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
marcw_2012 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
abrw_2012 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
maiw_2012 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
junw_2012 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
julw_2012 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
agow_2012 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
setw_2012 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
outw_2012 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
novw_2012 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]
dezw_2012 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2012';''').fetchall()[0][0]

months_w2012 = pd.DataFrame([janw_2012,fevw_2012,marcw_2012,abrw_2012,maiw_2012,junw_2012,julw_2012,
                           agow_2012,setw_2012,outw_2012,novw_2012,dezw_2012], columns=['Total'])
In [104]:
#2013
janw_2013 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
fevw_2013 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
marcw_2013 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
abrw_2013 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
maiw_2013 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
junw_2013 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
julw_2013 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
agow_2013 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
setw_2013 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
outw_2013 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
novw_2013 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]
dezw_2013 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2013';''').fetchall()[0][0]

months_w2013 = pd.DataFrame([janw_2013,fevw_2013,marcw_2013,abrw_2013,maiw_2013,junw_2013,julw_2013,
                           agow_2013,setw_2013,outw_2013,novw_2013,dezw_2013], columns=['Total'])
In [105]:
#2014
janw_2014 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
fevw_2014 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
marcw_2014 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
abrw_2014 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
maiw_2014 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
junw_2014 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
julw_2014 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
agow_2014 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
setw_2014 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
outw_2014 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
novw_2014 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]
dezw_2014 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2014';''').fetchall()[0][0]

months_w2014 = pd.DataFrame([janw_2014,fevw_2014,marcw_2014,abrw_2014,maiw_2014,junw_2014,julw_2014,
                           agow_2014,setw_2014,outw_2014,novw_2014,dezw_2014], columns=['Total'])
In [106]:
#2015
janw_2015 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
fevw_2015 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
marcw_2015 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
abrw_2015 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
maiw_2015 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
junw_2015 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
julw_2015 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
agow_2015 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
setw_2015 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
outw_2015 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
novw_2015 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]
dezw_2015 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2015';''').fetchall()[0][0]

months_w2015 = pd.DataFrame([janw_2015,fevw_2015,marcw_2015,abrw_2015,maiw_2015,junw_2015,julw_2015,
                           agow_2015,setw_2015,outw_2015,novw_2015,dezw_2015], columns=['Total'])
In [107]:
#2016
janw_2016 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '01' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
fevw_2016 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '02' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
marcw_2016 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '03' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
abrw_2016 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '04' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
maiw_2016 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '05' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
junw_2016 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '06' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
julw_2016 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '07' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
agow_2016 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '08' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
setw_2016 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '09' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
outw_2016 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '10' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
novw_2016 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '11' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]
dezw_2016 = c.execute('''SELECT count(timestamp) FROM ways WHERE strftime('%m',timestamp) = '12' AND 
strftime('%Y',timestamp) = '2016';''').fetchall()[0][0]

months_w2016 = pd.DataFrame([janw_2016,fevw_2016,marcw_2016,abrw_2016,maiw_2016,junw_2016,julw_2016,
                           agow_2016,setw_2016,outw_2016,novw_2016,dezw_2016], columns=['Total'])

3.4.2.2.3 Consolidação dos dados relativos as frequências de colaboração por meses do ano

In [108]:
#SOMANDO OS VALORES OBTIDOS A PARTIR DOS NODES E WAYS
total_2007 = months_w2007 + months_2007
total_2008 = months_w2008 + months_2008
total_2009 = months_w2009 + months_2009
total_2010 = months_w2010 + months_2010
total_2011 = months_w2011 + months_2011
total_2012 = months_w2012 + months_2012
total_2013 = months_w2013 + months_2013
total_2014 = months_w2014 + months_2014
total_2015 = months_w2015 + months_2015
total_2016 = months_w2016 + months_2016
In [132]:
#Consolidar os dados das contribuições
consolidado_meses = pd.concat([total_2007,total_2008,total_2009,total_2010,total_2011,total_2012,
                              total_2013,total_2014,total_2015,total_2016])
lista_meses = [1,2,3,4,5,6,7,8,9,10,11,12] * 10
anos = [2007] * 12 + [2008] * 12 + [2009] * 12 + [2010] * 12 + [2011] * 12 + [2012] * 12 + [2013] * 12 + [2014] * 12 + [2015] * 12 + [2016] * 12

consolidado_meses['mes'] = lista_meses
consolidado_meses['ano'] = anos
In [110]:
#Uma breve visualização no dataframe criado
consolidado_meses.tail()
Out[110]:
Total mes ano
7 9572 8 2016
8 4865 9 2016
9 5976 10 2016
10 1265 11 2016
11 0 12 2016
In [111]:
#Plotar um heatmap nos dados do dataframe consolidado_meses
sns.set(style="white", context='talk')
a1 = consolidado_meses.pivot('mes','ano','Total')
ax = plt.axes()
ax.set_title('Número de Constribuições por Período')
heatmap = sns.heatmap(a1, ax=ax, cmap="YlGnBu",annot=True, fmt="d", linewidths=.5, yticklabels=['Janeiro','Fevereiro','Março','Abril','Maio','Junho','Julho','Agosto','Setembro','Outubro','Novembro','Dezembro'])

Pelos dados consolidados no heatmap acima é possível constatar que a plataforma OpenStreetMap foi se consolidar em Brasília a partir de 2011, além do que em 2014 foi o ano que registrou o maior número de constribuições. O mapa também mostra que não existe um padrão para o mês de maior atividade ao longo do ano, todavia, os meses de junho a julho e de novembro a janeiro possuem significativos registros de atividade na plataforma, talvez esse padrão possa estar relacionado com os meses comuns a férias escolares onde as pessoas possuem mais disponilidade para usar a ferramenta.

3.4.2.3 Os usuários mais ativos na plataforma

Vamos determinar que são os usúarios mais ativos do site na cidade de Brasília.

In [115]:
most_active_users = c.execute("SELECT u.user, COUNT(*) as num FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) u GROUP BY u.user ORDER BY num DESC LIMIT 10;").fetchall()
most_active_users
Out[115]:
[('erickdeoliveiraleal', 129890),
 ('Linhares', 44317),
 ('street0501', 40615),
 ('MAPconcierge', 38185),
 ('teste18', 29044),
 ('wille', 17310),
 ('Rusleykcruz', 13741),
 ('woodpeck_repair', 13738),
 ('jadson_reis', 13081),
 ('charliekowacks', 12848)]

O usuário erickdeoliveiraleal é de longe o mais ativo dos usuários, totalizando um todo de 129890 submissões na plataforma. Como mostra o cálculo abaixo, isso representa aproxidamente 24% de todas as constribuições realizadas na cidade de Brasília!

In [118]:
most_active_users[0][1]/consolidado_meses['Total'].sum()
Out[118]:
0.23842258195374724
In [120]:
top1_pathway = pd.DataFrame(c.execute('''SELECT lat,lon from nodes where user ='erickdeoliveiraleal'  ''').fetchall(), columns=['lat','lon'])
In [124]:
#Armazenar as latitudes e longitudes em listas
lat_top1 = list(top1_pathway['lat'])
lon_top1 = list(top1_pathway['lon'])

O mapa abaixo é possível visualizar a impressão digital de colaboração do usuário erickdeoliveiraleal ao longo da região de Brasília. Como se trata de heatmap, as regiões em tonalidade mais avermelhada são as de maior atividade. A região do plano piloto concentra as atividades do usuário, bem como as regiões administrativas à oeste do plano, que também são as mais populosas da grande Brasília.

In [131]:
from folium import plugins
bsb_hp = folium.Map(location=[-15.7665, -47.9471], tiles='stamentoner',zoom_start=9)
bsb_hp.add_children(plugins.HeatMap(zip(lat_top1, lon_top1), radius = 15))
Out[131]:

3.5 Conclusão

O presente trabalho apresentou um trabalho de análise de dados com ciclo completo, desde a aquisição dos dados, tratamento destes, bem como alimentação de um banco de dados e posteriomente a exploração dos dados para obtenção de conclusões e insights a partir dos dados consolidados. Dessa maneira, o projeto proporcionou um aprendizado em situações de "vida-real" que um analista de dados pode encontrar em sua vida profissional.